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Abstract 


Reinforcement learning (RL) is widely used in applications where one needs to per- 
form sequential decision-making while interacting with the environment. The standard 
RL problem with safety constraints is generally mathematically modeled by constrained 
Markov Decision Processes (CMDP), which is linear in objective and rules in occupancy 
measure space, where the problem becomes challenging in the case where the model is 
unknown apriori. The problem further becomes challenging when the decision requirement 
includes optimizing a concave utility while satisfying some nonlinear safety constraints. To 
solve such a nonlinear problem, we propose a conservative stochastic primal-dual algorithm 
(CSPDA) via a randomized primal-dual approach. By leveraging a generative model, we 
prove that CSPDA not only exhibits O (1 / é’) sample complexity, but also achieves zero 
constraint violations for the concave utility CMDP. Compared with the previous works, 
the best available sample complexity for CMDP with zero constraint violation is O (1 / Ee). 
Hence, the proposed algorithm provides a significant improvement as compared to the 
state-of-the-art 


1. Introduction 


Reinforcement learning (RL) is a machine learning framework that learns to perform a task 
by repeatedly interacting with the environment. This framework is widely utilized in a 
wide range of applications such as robotics, communications, computer vision, autonomous 
driving, etc. (Arulkumaran et al., 2017; Kiran et al., 2021; Al-Abbasi et al., 2019; Geng 
et al., 2020; Chen et al., 2021a). The problem is mathematically formulated as a Markov 
Decision Process (MDP) which constitutes a state, action, and transition probabilities of 
going from one state to the other after taking a particular action. On taking an action, a 
reward is achieved and the overall objective is to maximize the sum of discounted rewards. 
However, in various realistic environments, the agent needs to decide action where certain 
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constraints need to be satisfied (e.g., average power constraint in wireless sensor networks 
(Buratti et al., 2009), queue stability constraints (Xiang et al., 2015), and safe exploration 
(Moldovan & Abbeel, 2012), etc.). The standard MDP equipped with the cost function 
for the constraints is called constrained Markov Decision process (CMDP) (Altman, 1999). 
It is well-known that the CMDP problem can be equivalently written as a linear program 
(LP) in occupancy measure space (Altman, 1999), where objective and constraints are 
linear with respect to occupancy measure. But in many applications demand more general 
non-linear objectives and constraints in terms of occupancy measure, e.g., risk-sensitive 
constraints /objectives (Mihatsch & Neuneier, 2002), maximizing the entropy of state-action 
distribution (Hazan et al., 2019), imitation learning (Ho & Ermon, 2016), and fairness in 
multi-agent resource allocation (Margolies et al., 2014) etc. In this work, we consider a novel 
MDP with concave objective and convex constraints and call it CCMDP (concave CMDP). 
We remark here that CCMDP is still a constrained convex optimization problem. it can 
be efficiently solved by using any existing solution from constrained optimization literature. 
But the main issue here is that to do so, one would need to access the transition probabilities 
of the environment, which is not available in realistic model-free environment settings. 
Hence, efficient approaches to develop model-free algorithms for CCMDP are required. 
Before, moving forward, we provide a motivating example here. For more examples, one 
may refer to (Zhang et al., 2020). 


Example 1. (Maximaing Entropy)(Hazan et al., 2019) A fundamental problem in rein- 
forcement learning is that of exploring the state space. How do we understand what is even 
possible in the context of a given environment in the absence of a reward signal? Such 
a problem is useful in a realistic setting since reward functions may be poorly specified or 
sparse. A possible quantity of interest is the entropy of the induced distribution since such 
an objective will encourage the agent to explore uniformly in the MDP. The maximizing 
entropy environment is formally defined as 


max —} 7) AF log[Ag] (1) 


where X™(s) = (1-7) Oy a=, yPGc= sa; = a)) is the normalized occupancy measure. 


Remark 1. Jt is well known that the entropy is a concave function, which satisfies the 
Assumption 1. However, to make the example also satisfy the Assumption 2, one may 
define a shifted function as f(A) = — >0,(As +c) log(As +e), where c > 0 is a positive shift 
parameter. Thus, the Lipschitz property can be guaranteed. 


To solve the CMDP problem without apriori knowledge (in a model free manner) of 
the transition probability, various algorithms are proposed in the literature (See Table 1 
for comparisons). The performance of these algorithms is measured by the number of 
samples (number of state-action-state transitions) required to achieve «-optimal (objective 
sub-optimality) e-feasible (constraint violations) policies. An e-feasible policy means that 
the constraints are not completely satisfied by the obtained policy. However, in many 
applications, such as in power systems (Vu et al., 2021) or autonomous vehicle control (Wen 
et al., 2020), violations of constraint could be catastrophic in practice. Hence, achieving 
optimal objective guarantees without constraint violation is an important problem and is 
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the focus of the paper. More precisely, we ask the question, “Js it possible to achieve the 
optimal sublinear convergence rate for the objective while achieving zero constraint violations 
for CCMDP problem without apriori knowledge of the transition probabilities?” 


We answer the above question in the affirmative in this work. We remark that the sample 
complexity result in this work exhibits tight dependencies on the cardinality of state and 
action spaces (cf. Table 1). The key contributions can be summarized as follows: 


e To best of our knowledge, this work is the first attempt to provide model-free algorithm 
for CCMDPs that achieves optimal sample complexity with zero constraint violation. 
There exist one exceptions (for the special case of CMDP) in the literature which 
achieves the zero constraint violation but at the cost of O (1 / €°) sample complexity 
to achieve € optimal policy (Wei, Liu, & Ying, 2021). In contrast, we are able to 
achieve zero constraint violation with O (1 vi e?) sample complexity. 


e This is the first attempt that provides a model-free algorithm for CCMDPs. The key 
challenge for solving CCMDP is the formulation of the unbiased estimator for the 
Lagrangian function. A trivial estimator following from previous work (Bai, Bedi, 
Agarwal, Koppel, & Aggarwal, 2022b) will lead to a biased estimator and make the 
analysis challenging (see Remark 3 for details). 


e We utilized the idea of conservative constraints to derive the zero constraint viola- 
tions. Such an idea was used recently for showing zero constraint violations in online 
constrained convex optimization in (Akhtar et al., 2021). However, the problem of 
CCMDP is more challenging than online constrained optimization because (1) How to 
achieve an unbiased estimator is unknown and (2) Following the same idea can only 
derive zero violation in the occupancy measure domain (see Theorem 2), while zero 
violation in the policy domain is required. Theorem 5.3 is then used to derive such 
results utilizing the novel analysis unique to this work. 


e The adaptive state-action pair sampling in the proposed approach would lead to the 
high dependence of the number of state and action space if the standard stochastic 
optimization analysis is directly applied (See Remark 4 for details). To match the 
lower bound, we use KL divergence as the regularizer for the dual update, which is 
similar to (Zhang et al., 2021). 


e To provide empirical evidence, we solve a problem of queuing systems in Sec. 6 and 
show the efficacy of the proposed algorithm. 


2. Related Works 


In this section, we list the related works in model-free constraint RL and Concave Utility 
RL fields. For the other works, please refer to Table 1. 
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Algorithm Sample Complexity | Constraint violation | Generative Model 
Model-Based OptDual-CMDP (Efroni et al., 2020) 4 O(6) No 
OptPrimalDual-CMDP (Efroni et al., 2020) + O(6) No 
UC-CFH (Kalagarla et al., 2021) ? O(6) No 
CONRL(Brantley et al., 2020) O(6) No 
OptPess-PrimalDual (Liu et al., 2021a) O wae O(e?) No 
OPDOP (Ding et al., 2021)[Theorem 2] (6) pte O(e) No 
UCBVI-y (He et al., 2021)[Theorem 4.3] 0( Site) N/A No 
Model-Free NPG-PD (Ding et al., 2020)[Theorem 4] * O wilt O(6) Yes 
CRPO (Xu et al., 2021) 4 6( Sits) Ole) Yes 
PDSC (Chen et al., 2021b) ® O Puy O(e) Yes 
Triple-Q (Wei et al., 2021) a Zero No 
Randomized Primal-Dual (Wang, 2020) O (aS J N/A Yes 
CSPDA (This work, Theorem 3) ® O ool) Zero Yes 
Lower bound | (Lattimore & Hutter, 2012) and (Azar et al., 2013) a( St ) N/A N/A 
(Vaswani et al., 2022) (fila ) Zero N/A 


Table 1: This table summarizes the different model-based and mode-free state of the art 
algorithms available in the literature for CMDPs, where y is the Slater variable 
in Assumption 3. It is worthy to notice that the lower bound for zero constraint 
violation and unconstrained problem are different. We note that the proposed 
algorithm achieves the best sample complexity compared with all other model-free 
approaches which requires generative model and achieves zero constraint violation 
at the same time. For the works considering different setting such as episodic 
setting, we provide a detailed method to convert the result to the form of sample 
complexity in infinite horizon setup in Appendix A.1. 


Model-free CRL. As compared to the model-based algorithms, existing results for the 
model-free algorithms are fewer. The constrained policy optimization (CPO) algorithm is 
proposed in (Achiam et al., 2017) and reward constrained policy optimization (RCPO) al- 


1. (Efroni et al., 2020) used NV’, which is the maximum number of non-zero transition probabilities across 
the entire state-action pairs. We bound it by S. Moreover, a factor of JIA is missed in their result, 
which we believe is a typo in their work. 

2. (Kalagarla et al., 2021) used C, which is the upper bound on the number of possible successor states for 
a state-action pair. We bound it by S. 

3. We use the result in Theorem 4 in (Ding et al., 2020). Notice that in the Algorithm 2 of their paper, 
= samples are necessary for each outer loop. 

4, Notice that in line 4 of Algorithm 1 in (Xu et al., 2021), a inner loop with Ki, iteration is needed for 
policy evaluation and Kin = Oa) 

5. The dependence on S, A is not clear in (Chen et al., 2021b). An estimation for the Q-function is needed 
in the algorithm. However, the authors didn’t include analysis for the estimation. 

6. Notice that the value function defined in this paper is a normalized version. Thus, an extra Goa? is 


needed for a fair comparison. 
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gorithm is proposed in (Tessler et al., 2018). Moreover, in (Gattami et al., 2021), it related 
CMDP to zero-sum Markov-Bandit games and provided efficient solutions for CMDP. How- 
ever, these works did not provide any convergence rates for their algorithms. Furthermore, 
the authors in (Ding et al., 2020) proposed a primal-dual natural policy gradient algorithm 
both in tabular and general settings and have provided a regret and constraint violation 
analysis. A primal-only constraint rectified policy optimization (CRPO) algorithm is pro- 
posed in (Xu et al., 2021) to achieve a sublinear convergence rate to the global optimal 
policy and a sublinear convergence rate for the constraint violations as well. Most of the 
existing approaches with specific sample complexity and constraint violation error bound 
are summarized in Table 1. Recently, (Chen et al., 2021b) translated the constrained RL 
problem into a saddle point problem and proposed a primal-dual algorithm which achieved 
O(1 /e?) sample complexity to obtain e-optimal ¢- feasible solution. However, the policy 
is considered as the primal variable in the algorithm and an estimation of Q-table is re- 
quired in the primal update, which introduces extra sample complexity and computation 
complexity. 


Concave Utility RL. Another major research area related to constrained RL is concave 
utility RL. A special case of maximizing the entropy is considered in (Hazan et al., 2019). 
(Kostrikov et al., 2019) considered a KL-divergence minimization for imitation learning. 
(Bai et al., 2022a; Brantley et al., 2020; Agarwal et al., 2022a; Agarwal & Aggarwal, 2023) 
considered a concave function of possibly vector rewards. Among these works, (Brantley 
et al., 2020; Agarwal et al., 2022a; Agarwal & Aggarwal, 2023) proposed a model-based 
approach and (Bai et al., 2022a) proposed a model-free policy gradient algorithm. (Zhang 
et al., 2020, 2021; Ying et al., 2023) and this work considered a more general setting, where 
the objective function is a concave function of the occupancy measure. However, all of the 
other works did not target zero-constraint violations. Recently, (Agarwal et al., 2022b) pro- 
posed model-based algorithms based on optimism and posterior sampling approaches that 
achieves zero constraint violations. In contrast, our work considers a model-free approach. 


3. Problem Formulation 


An infinite horizon discounted reward constrained Markov Decision Process (CMDP) is 
defined by tuple (S,A,P,r,g’,/,7,p). In this model, S denotes the finite state space 
(with |S| number of states), A is the finite action space (with |A| number of actions), and 
P: Sx A— AIS! gives the transition dynamics of the CMDP (where A denotes the 
probability simplex in d dimension). More specifically, P(-|s,a@) describes the probability 
distribution of next state conditioned on the current state s and action a. We denote 
P(s'|s,a) as P,(s,s’) for simplicity. In the CMDP tuple, r: S x A — [0,1] is the reward 
function, g’ : S x A > [1,1] is the i” constraint cost function, and J denotes the number 
of constraints. Further, 7 is the discounted factor and p is the initial distribution of the 
states. 


Let us define the stationary stochastic policy as 7: S  Al4l, which maps a state to a 
distribution in the action space. The value functions for both reward and constraint’s cost 


979 


Bal, BEDI, AGARWAL, KOPPEL, & AGGARWAL 


following such policy 7 are given by (Chen et al., 2021b) 
‘9 — _ CO +t 
Ve(s) = (1=9)E| D2, anter0)), 
vg) = 0-E| 


Pa Te'lona)] (2) 
for all s € S. At each instant t, for given state s; and action a; ~ 7(-|sz), the next state 5444 
is distributed as 5:41 ~ P(-|s;,a,). The expectation in (2) is with respect to the transition 
dynamics of the environment and the stochastic policy 7. The standard CMDP problem 
considers the problem maximizing value function for reward and satisfying some constraints 
on value function for cost function, given by 


max V,"(s) 


x (3) 
s. t. Vials) 20 Wie [J], 


Next, let us define A” : Sx .A — [0,1] is known as cumulative discounted occupancy measure 
under policy 7 given by 


A"(s,a) = (1 — (>, 'P(s_ = 8,04 = a)), (4) 


where so ~ p, a ~ 1(-|S¢), P(s; = 5,a¢ = a) is the probability of visiting state s and 
taking action a in step t. Then, the problem in (3) which optimizes over policy space, 
can be equivalently written in the occupancy measure space (Zhang et al., 2021) (Altman, 
1999)|Theorem 3.3] as 


max Alr 
X>0 
st. ATg,>0 Vie [I], (5) 


ac all — WPa)Ae = (1 -)p. 


We note that in (5), the objective and constraints are linear with respect to A. In this work, 
we are interested in non-linear objective (concave) and non-linear constraints (convex) which 
arises frequently in the literature, for instance in maximizing the entropy of state-action 
distribution (Hazan et al., 2019), imitation learning (Ho & Ermon, 2016), and fairness in 
multi-agent resource allocation (Margolies et al., 2014). The concave utility constrained 
optimization problem can be formulated as 


me 
st. W(A)>O Vie [I], (6) 
cal VPa)Aa = (1-V)p, 


where f is a known concave objective, h’,i € [I] are constraint functions. 
In (6), we define Aq = [A(1,a),--- , A(\S|,@)] € R'S! as the a” column of A. Notice that 
the equality constant in Eq. (6) sums up to 1, which means J is a valid probability measure 
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and we define A := {A| >°,,A(s,@) = 1} as a probability simplex. For a given occupancy 
measure A, we can recover the policy 7 as 


t(a|s) = SAGE (7) 


Using Theorem 3.3(c) in (Altman, 1999), we have that if A* is the optimal solution for the 
problem in Eq. (6), then 7» will be the corresponding optimal policy. 


4. Algorithm Development 


Before developing the algorithm, we first describe some assumptions and demonstrate some 
properties of the objective function and constraint functions in (6). 


Assumption 1. (Concavity) The objective function f and constraint functions h',i € [I] 
are concave functions with respect to the occupancy measure X on the set A. 


Assumption 2. (Lipschitz) The objective function f and constraint function h',i € [I] are 
Lipschitz functions with Lipschitz constant L¢ and L), with respect to the occupancy measure 
A on the set A. For simplicity, we assume Ly > 1 and Ly > 1 (i.e. use L', = max{Ly,1}) 
Formally, for any A,X € A 


F(A) — FAIll2 < Lyl|A— Alle (8) 
I|h(A) — h(A)|]2 < Lalla — Alle (9) 
Under Assumption 1 and 2, we derive the following Lemmas. 


Lemma 1. (Shalev-Shwartz et al., 2011)/Lemma 2.6] The gradient of objective function 
and constraint function are bounded by their Lipschitz constants on the set A. Formally, 
VafAjll2 < Ly, VAEA 
|Vah'(A)llo < Ln, VA € A, Wi € [I]. 


Lemma 2. The objective function and constraint functions are bounded by a constant on 
the set A, respectively. Without loss of generality, we assume they are bounded by 1. 


Proof. Define » = si ae where e is one vector. By Assumption 2, we have for any A € A 


f(A) — FAIll2 < LyllA- Alle < Lp VAlls|. 
Thus, we can write || f(A)|l2 < Lev JAS] + f(A). 


Assumption 3. (Strict feasibility) There exists a strictly feasible occupancy measure 5) 
to problem in (11) such that 


AH(A)-y~>O0 Wiel] 
Y= yPD)Ae = (1-7) oe 


a 


for some0<y<l. 
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Remark 2. Assumption 8 is the stronger version of the popular Slater’s condition which 
is often required in the analysis of convex optimization problems. A similar assumption is 
considered in the literature as well (Mahdavi et al., 2012; Akhtar et al., 2021) and also helps 
to ensure the boundedness of dual variables (see Lemma 3). 


The problem in (6) is well studied in the literature for the linear objectives and con- 
straints. In this work, we consider concave utilities and the aim is to develop an algorithm 
to achieve zero constraint violation without suffering for the objective optimality gap. To 
do so, we consider the conservative stochastic optimization framework presented in (Mah- 
davi et al., 2012; Akhtar et al., 2021) and utilize it to propose a conservative version of the 
constrained problem with general utility function in (6) as 


max f(A) (11a) 
st» MALS 6 Vi eT), (11b) 
SS" (I-7P7)Aa = (1-)p, (11¢) 

acA 


where « is the tuning parameter that controls the conservative nature for the constraints. 
The idea is to consider a tighter version (controlled by «) of the original inequality constraint 
in (6) which allows us to achieve zero constraint violation for CMDPs which does not hold 
for any existing algorithm. It should be noticed that « and y are two different concepts. & 
is an artificially added parameter, while y is the intrinsic property of the original problem. 
Moreover, By the assumption 3, it is natural to see that 0 << & < y < 1 and we will specify 
the specific value of the parameter « later in the convergence analysis section (cf. Sec. 5). 
With Assumption 1, note that the conservative version of the problem in Eq. (11) is 
still a convex programming and hence the strong duality holds under Slater condition in 
Assumption 3, which motivates us to develop the primal-dual based algorithms to solve 
the problem in (11). By the KKT theorem, the problem in Eq. (11) is equivalent to the 
following a saddle point problem which we obtain by writing the Lagrangian of (11) as 


L(A, wv) =f(A) + dein (MA) — 8) 


+ (1-4) (p,v) + SAG (yPa — Dv 


acA 
=f (A) + (u,h7 (A) — 61) 
+(1—9)(p,v) + $0 AT Pa — Dy, (12) 
acA 
where u := [u1,u2,--- ,u’]” is a column vector of the dual variable corresponding to con- 


straints in (11b), v is the dual variable corresponding to equality constraint in (11c) and 
h := [h!,--- ,h’] collects all the h’’s corresponding to J constraints in (11b), and 1 is the 
all one column vector. From the Lagrangian in (12), the equivalent saddle point problem is 
given by 


max min L(A, u,v). (13) 
AcA u>0,v 
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Since the Lagrange function is concave w.r.t. primal and convex w.r.t dual variables, it is 
known that the saddle point can be solved by the primal-dual gradient descent (Nedié & 
Ozdaglar, 2009). However, since we assume that the transition dynamics P,; is unknown, 
then directly evaluating gradients of Lagrangian in (13) with respect to primal and dual 
variables is not possible. To circumvent this issue, we resort to a randomized primal dual 
approach proposed in (Wang, 2020) to solve the problem in a model-free stochastic manner. 
We assume the presence of a generative model which is a common assumption in control/RL 
applications. The generative model results the next state s’ for a given state s and action 
a in the model and provides a reward r(s,a) to train the policy. To this end, we consider a 
distribution ¢ over S x A to write a stochastic approximation for the Lagrangian L(A, u, v) 
in (13) as 


(8, a) [yv(s!) = v(s) = Mi] 
C(s, a) 


ain, (A, u, v) = (1 == 7)v(so) oo 1¢(s,a)>0 i (14) 


+ f(A) + (u, h(A) — 61) — Mod, 


and so ~ p, the current state action pair (s,a) ~ ¢, and the next state s’ ~ P(-|s,a). We 
¢ 
(s,a,8’),80 
for the Lagrangian function in Eq. (12) if we omit the constant Mj, and M2, which implies 
that EexP( (s.a)/£7,.6.3% ae = L(A,u,v) + My + M2 with supp(¢) C supp(A). We could 
see ¢ as a adaptive state-action pair distribution which helps to control the variance of the 
stochastic gradient estimator. The stochastic gradients of the Lagrangian with respect to 
primal and dual variables are given by 


remark that the stochastic approximation L (A, u, v) in (18) is an unbiased estimator 


v(s’) — v(s) — My, 


2 af 

VaAL(a, u, v) = 1¢(s,a)>0 : C(s a) Esa 

+Vaf(A) + So ulVah'(A) = Mel, (15) 
iE [I] 

Vul(A, u,v) =h(A) — «1, (16) 

- X(s,a)(ye(s’) — e(s 

VVL(A, u, v) =e(59')+1¢(s,a)30 . ( ae i ( )) (17) 
where we define e(so’) = (1 — y)e(so) with e(so) € RIS! being a column vector with all 


entries equal to 0 except only the s‘” entry equal to 1, Esa € RIS!*!4! is a matrix with only 
0 


the (s,a) entry equaling to 1 and all other entries being 0. We remark that M, and Mg in 
(15) is a shift parameter that is used in the convergence analysis. 


Remark 3. We note that the special case presented in (Bai et al., 2022a) for CMDP uses 
a similar primal-dual method as follow. 


aaa u,v) ue) 
. Ns,4)(Zoa— M) 
= (1—7)v(s0) + 1e(s,a)>0 ° C(s, a) ae Mw 
where 
Lisa = r(s, a) + yv(s') = v(s) + SS ug’ (s, a), (19) 
iE [I] 
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However, the approximated Lagrange function defined in (18) is different from the above 
equations. It can be noticed that the above approach extended to general functions leads to 
a biased estimator of the gradient of approximated Lagrange due to the nonlinear function 
f and g. This biased estimation will make the analysis much more challenging due to the 
analysis in Appendix C.8 and C.11 requiring unbiasedness. Thus, in this paper, we redefine 
the approximated Lagrange function, where we only sample for transition function but not 
together with objectives or constraints. The estimator in (15) is an unbiased estimator for 
the gradient with respect to X. 


Remark 4. It should be noticed that despite the proposed estimator having a bounded 
second-order moment, the standard analysis of the stochastic optimization will lead to an 


extra factor of O( Bia, This is because for a given state and action pair (s,a) with 


¢(s,a) >0 


E vaca, u, v)(s, a] 


—E 1 yv(s') > v(s) = M, Vv BY iv h? Xd M. ° 
Bene [euaino MTT —— + VAFCAN8) + Yo wVA AN) — My 
ie 
As for the first item, 
I\ —~ M 2 
Es,a,5! tcie0)30 . aad pore _ (20) 
= yv(s') — v(s) — M,\? 
=Bele(s0) (MG) | o 


Lyv(s') = v(s) = Mi)? | 


=E,’ 
ic — 5)A(s,a) + Oty 


4 
<5|S||AlMz =: 9" 
where we can find the bound of the second moment has a dependence on Bua By the result 
of standard stochastic optimization analysis (Juditsky, Nemirouski, & Tauvel, 2011)/[Corol- 
lary 1], the convergence rate has a dependence on a, which finally leads to an extra or- 


der of O(4/ SIAly To solve this problem, we use the KL divergence to regularize the oc- 
cupancy measure updates. By using KL divergence, we do not require to bound the sec- 


ond moment, but need to bound B| Da A(s,a)2,| where Aga is the (s,a)" element of 
VyL(A, u,v)(s,a) (Lemma 6). Hence, unsampled (or less sampled) state-action pairs do 


not contribute to the update. However, one still needs to ensure that the initial distribution 
over state action pairs support all state action pairs (Lemma 7 and Appendix C.6). 


With all the stochastic gradient definitions in place, we are now ready to present the 
proposed novel algorithm called Conservative Stochastic Primal-Dual Algorithm (CSPDA) 
summarized in Algorithm 1. First, we initialize the primal and dual variables in step 1. In 
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Algorithm 1 Conservative Stochastic Primal-Dual Algorithm (CSPDA) for constrained 
RL 

Input: Sample size T. Initial distribution p. Discounted factor ¥. 

Parameter: Step-size a, 0. Slater variable y, Shift-parameter M7, Conservative variable « 
and Constant 6 € (0, $) 


Output: Nema a ee ,u’ and Sa iv’ 
1: Initialize u! € U, v' € V and Al = aay 1 
2: for t= 1,2,...,7 do 


a C= (1 ON + soft 


4: Sample (s;,a;) ~ ¢' and so ~ p 
5: Sample s; ~ P(-|az, 8;) from the generative model and observe reward rq 
6: Update value functions as u and v as 
ut? =Thy(u! — aVul(a’,u’, v’)) (22) 
+) Thy (v* —aVylLO’, u’, v")) (23) 


7: Update occupancy measure as 
, 1 
Nt = arg max (Valin, u’,v’),A— a’) - BEEAIy) (24) 
AT atta /|Ae+2 |, (25) 


8: end for 


step 4 and 5, we sample (s;, az, 80) and then obtain s} from the generative model. In step 
6, we update the dual variables by the gradient descent step and a projection opration (See 
Lemma 3 for the definition of U/ and V). In step 7, we utilize the mirror ascent update 
and utilize the KL divergence as the Bregman divergence to obtain tight dependencies on 
the convergence rate analysis similar to (Wang, 2020). Then, the occupancy measure is 
normalized so that it remains a valid distribution. 


5. Convergence Analysis 


In this section, we study the convergence rate of the proposed Algorithm 1 in detail. We 
start by analyzing the duality gap for the saddle point problem in (13). Then we show that 
the output of Algorithm 1 given by A is e-optimal for the conservative version of the dual 
domain optimization problem in (11) of CMDPs. Finally, we perform the analysis in the 
policy space and present the main results of this work. We prove that the induced policy 7 
by the optimal occupancy measure J is also €-optimal and achieves zero constraint violation 
at the same time. 


5.1 Convergence Analysis for Duality Gap 


In order to bound the duality gap, we note that the standard analysis of saddle point 
algorithms (Nedi¢é & Ozdaglar, 2009; Akhtar et al., 2021) is not applicable because of the 
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unbounded noise introduced into the updates due to the use of adaptive sampling of the 
state-action pairs (Wang, 2020; Zhang et al., 2021). Therefore, it becomes necessary to 
obtain explicit bounds on the gradient as well as the variance of the stochastic estimates 
of the gradients. Define (A*, u%,v*) as the solution of saddle-point problem in Eq. (13). 
Notice that the optimal primal and dual variables are the function of conservative variable 
k. When « = 0 which means we are considering the original problem in Eq. (6), we omit 
the subscript « and denote optimal primal and dual variables as (A*, u*, v*). We start the 
analysis by consider the form of Slater’s condition in Assumption 3, and show that the dual 
variables u and v are bounded. 


Lemma 3 (Bounded dual variable u and v). Under the Assumption 8, the optimal dual 
variables ux and v* are bounded. Formally, it holds that ||ut||1 < a and ||vEllo < 
Ly . Abyln 
I-y | (l= )¢" 


The proof of Lemma 3 is provided in Appendix, C.1. As a result, we define U := 
{u| luli < 2} and v= {v | IIvllo < 2-4 + ESS}. 
Since we have mathematically defined the set U/ and V, now we rewrite the saddle point 


formulation in (13) as 


max min L(A,u,v). (26) 
AcA (ucel,veEV) 
In the analysis presented next, we will work with the problem in (26). First, we decompose 
the duality gap in Lemma 4 as follows. 
Lemma 4 (Duality gap). For any dual variables u,v, let us define w = [u? v7)", 
consider 0,V,A as defined in Algorithm 1, the duality gap can be bounded as 


and 


T 
= 1 
L(u,v,A%) — L(u,v, A) < roe | (Vactw!, a), Xt — dr‘) + (VwLl(w', A‘), w! —w) |. 
t=1 —_—_————”’ ae — ee” 
(I) (II) 


(27) 


The bound on terms (/) and (JJ) in the statement of Lemma 4 are provided in Lemma 
6 and 7 in the Appendix C.3 (see proofs in Appendix C.4 and C.5, respectively). This helps 
to prove the main result in Theorem 1, which establishes the final bound on the duality gap 
as follows. 


Theorem 1. Define (u',v') := argminu,y L(u,v,A). Recall Ax, is the best solution for the 
conservative Lagrange problem. The duality gap of the Algorithm 1 is bounded as 


(28) 


RIC(a, #,A*) — C(ut, vt, X)] < o( SK loa( SAD) . pats 


The proof of Theorem 1 is provided in Appendix C.3. The result in Theorem 1 describes 
a sub-linear dependence of the duality gap onto the state-action space cardinality upto a 
logarithmic factor. In the next subsection we utilize the duality gap upper bound to derive 
a bound on the objective suboptimality and the constraint violation separately. 
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5.2 Dual Objective and Constraint Violation 


Recall that the saddle point problem in Eq. (26) is an equivalent problem to Eq. (6) where 
the main difference arises due to the newly introduced conservativeness parameter «. Thus, 
a convergence analysis for duality gap should imply the convergence in occupancy measure 
in Eq. (11). But before that, we need to characterize the gap between the original problem 
(6) and its conservative version in (11). The following Lemma 5 shows that the gap is of 
the order of parameter k. 


Lemma 5. Under Assumption 3, and condition k < min{$,1}, it holds that the difference 


of optimal values between original problem and conservative problem is O(K). Mathemati- 
cally, it holds that (X*,r) — (Ax,¥) S ¢.- 


The proof of Lemma 5 is provided in Appendix D.1. Using the statement of Lemma 5 
and Theorem 1, we obtain the convergence result in terms of output occupancy measure in 
following Theorem 2. 


Theorem 2. For any0 <e¢ <1, there exists a constant C, such that if 


1 L4L;,1\S\|A| log(|S|| Al) 
2 ~2 ~f 7h 
T > max {16,4 : >} er (i —73y2 ; (29) 
and we set 
QL Lpe L 4D ¢L 8L,L 
= 2ebats, STAT ORCSTAD, yy, = | ae 
se - beg See y 
then the constraints of the original problem in (6) satisfy: 
E[A(A)| > ep Wie [I], (30a) 
- 1— 
E|| SSOPT -DAa+(1- al] < et (30b) 
1 LyslLp 


a 


Additionally, the objective sub-optimality of (6) is given by 


B[f(A*) — f(A)] < 3¢. (31) 

The proof of Theorem 2 is provided in Appendix D.2. Next, we present the special case 
of Theorem 2 in the form of Corollary 1 (see proof in Appendix D.3), which shows the 
equivalent results for the case without conservation parameter, K = 0. 


Corollary 1 (Non Zero-Violation Case). Set « =0. For any « > 0, there exists a constant 


9 LFLZI|S||Allog(\S|IAl) 


c such that if T > c then X satisfies the constraint violation as 


C= yere 
E[A'(A)] >—e Wie [I] (32a) 
g|| OPT —DAa + (1—r)ol]. < a (32b) 


a 


and the sub-optimality is given by E[f(A*) — f(A)] < . 


The positive lower bound of ey in (30a) hints that is feasible (hence zero constraint 
violation). On the other hand, the lower bound in (32a) is negative —e which states that 
the constraints in the dual space may not be satisfied for A. Next, we show that how the 
result in Theorem 2 helps to achieve the zero constraint violation in the policy space. 
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5.3 Convergence Analysis in Policy Space 


We have established the convergence in the occupancy measure space in Sec. 5.2 and shown 
that A achieves an €-optimal e-feasible solution but the claim of zero constraint violation 
is still not clear. But a small violation in Eq. (30b) makes A to loose its physical meaning 
as discussed in Proposition 1 in (Zhang et al., 2021). Thus, to make the idea clearer and 
explicitly show the benefit of the conservative idea utilized in this work, we further present 
the results in the policy space. The bound in Eq. (30b) provides an intuition that the 
output occupancy measure is close to the optimal one and therefore, the induced policy 
should also be close to the optimal policy. Such a result is mathematically presented next 
in Theorem 3. 


Theorem 3 (Zero-Violation). Under the condition in Theorem 2 the induced policy 7 by 
the output occupancy measure X is an €-optimal policy and achieves 0 constraint violation. 
Mathematically, this implies that 


f(r") -E[f(a")] < (33a) 
B[A'(A™)| >0 Vie [J]. (33b) 


The proof of Theorem 3 is provided in Appendix E.1. To get better idea about the 
importance of result in Theorem 3, we next present a Corollary 2 (see proof in E.2) which 
is a special case of Theorem 3 for & = 0. 


Corollary 2 (Non Zero-Violation Case). Under the condition in Corollary 1, the induced 
policy 7 by the output occupancy measure A is an €-optimal policy w.r.t both objective and 
constraints. More formally, 


f(A") —E[F(A")] < (34a) 
B[A'(A™)] > —e Vi € [I]. (34b) 


The benefit of utilizing the conservation parameter « becomes clear after comparing the 
results in (33b) and (34b). 


6. Empirical Evaluations 


In this section, we evaluate the proposed CSPDA algorithm, on three different environments. 
For the first environment, as considered by (Liu et al., 2021b), we construct a random MDP. 
The second environment is a grid world environment where the agent needs to cross a border 
to reach a goal state and the fastest route is unsafe and there exists another route which is 
safe but longer (Paternain et al., 2019). The third environment is a queuing system with a 
single server in discrete time (Altman, 1999, Chapter 5) as considered in (Agarwal et al., 
2022c; Gattami et al., 2021). We now provide the experimental details and simulation 
results separately for each of the environments. 


6.1 Random MDPs 


The random MDP has 100 states with transition probabilities sampled from a Dirichlet 
distribution. The rewards r(s,a) are sampled from a uniform distribution over [0,1) and 
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Figure 1: Learning Process of the proposed algorithm for linear objective and constraint 
value with « = 0 and x > 0. The total reward is the objective in (37) with c = 0 
and the constraint value is the L.H.S. of the constraint in (37). 


the costs c(s,a) are sampled from a uniform distribution over [—0.5,0.5). The goal of the 
agent is to maximize the average reward \’ r while ensuring the average cost parameter \7'c 
is at least 0. 

We sample a single MDP and run 100 independent runs of the CSPDA algorithm on 
the sampled MDP. For this example, we choose the value of T = 10000 and the step sizes 
qa and £ are set in accordance to Section C.3 as: 


— _ LyLny/\S| 
> algal a 


_ =7)¢ /log(|S]Al) 
po LyLn T|S|A| cee) 


with value |S| = 100, |A| = 4, with Ly = L;, = 1 as we consider a linear setup of maximum 
reward and cost bounded by 1. Further, since we have only one cost function, we have 
I =1. Finally, we set the value of ¢ = 0.48 and the value of 6 = 0.01. 

We present the simulation results for the proposed CSPDA algorithm on the random 
MDP in Figure 2. We note that the choice of « plays a significant role in the performance 
of the algorithm. The objective value of the average rewards is higher, but not significantly, 
for « = 0. However, when comparing the average cost values, the implementation with 
k > 0 performs significantly better showing the role tuning « can play in obtaining the 
performance of the learnt policy. 


6.2 Gridworld Environment 


We next evaluate the proposed algorithm on a 15 x 15 gridworld environment. The agent 
starts from a fixed position on the map and can move in 4 directions if permitted. The 
agent aims to cross the room and reach the goal state as soon as possible to obtain some 
reward. The map of the gridworld is presented in Figure 3. The agent does not receive any 
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Figure 2: Learning Process of the proposed algorithm for objective and constraint value 
with « = 0 and « > 0 evaluated on random MDP with 100 states and 4 actions. 


Goal cell 


[| Boundary 


Restricted cell 


Allowed cell 


Figure 3: Map of the gridworld environment. The agent has to reach the goal state to 
obtain a reward of 1 unit. If the agent crosses the red cell, it incurs a cost of —1 
unit whereas the agent can cross the green cell without incurring any cost. 


reward till it reaches the goal state. After the agent reaches the goal state, the agent does 
not change the state and receives a reward of unit 1 till eternity. The room has a wall in 
the middle which separates the starting cell and the goal state. The wall has two openings, 
one of which is restricted. If the agent used the restricted cell, it can reach the goal faster. 
However, it received a penalty in terms of a cost of —1. We aim to not allow the agent pass 
through the restricted cell, and thus, the average cost should be non-negative. 


990 


ACHIEVING ZERO CONSTRAINT VIOLATION FOR CURL VIA PRIMAL-DUAL APPROACH 


1.25 
0.0254 
a> — kappa=0 
1.00 on q 
| 6 Wee, — _kappa>o 0.000 } 
0.75 OO 
a —0.025 4 
=] uv 
7% «(0.50 2 
a S$ -0.0504 Ve ee 
v z 
> 0.25 = ——”§~—m 
i # -0.075 | a 
‘| 0.00 § r| y 
fe) 
—0.100 4 
—0.25 | fj 
0.125 | | —— cost kappa = 0 
—0.50 i 
\ | —— cost kappa > 0 
| | | | —0.1501 t + i i : 
) 2000 4000 6000 8000 10000 ) 2000 4000 6000 8000 10000 
Iteration t Iteration t 


Figure 4: Learning Process of the proposed algorithm for objective and constraint value 
with « = 0 and « > 0 evaluated on random MDP with 225 states and 4 actions. 


We again run 100 independent runs of the CSPDA algorithm on the sampled MDP. We 
present present the simulation results for the proposed CSPDA algorithm on the gridworld 
in Figure 4. We set the value of 6 = 0.0001 and a = 10. We again note that the choice of 
« plays a significant role in the performance of the algorithm. The objective value of the 
average rewards is higher, but not significantly, for « = 0. However, when comparing the 
average cost values, the implementation with « > 0 performs significantly better showing 
the role tuning « can play in obtaining the performance of the learnt policy. 


6.3 Evaluations on a Queuing System 


In this section, we evaluate the proposed Algorithm 1 on a queuing system with a single 
queue. In this model, we assume a buffer of finite size L. A possible arrival is assumed 
to occur at the beginning of the time slot. The state of the system is the number of 
customers waiting in the queue at the beginning of time slot such that the size of state 
space is |S| = Z+ 1. We assume that there are two kinds of actions: service action and 
flow action. The service action is selected from a finite finite subset A of [a@min, @maz] Such 
that 0 < amin < Qmar < 1. With a service action a, we assume that a service of a customer 
is successfully completed with probability a. If the service succeeds, the length of the queue 
will reduce by one, otherwise queue length remains the same. The flow action is a finite 
subset B of [bmin, bmax] such that 0 < bmin < bmax < 1. Given a flow action b, a customer 
arrives with probability 6. Let the state at time t be x, and we assume that no customer 
arrives when state 7; = L. Finally, the overall action space is the product of service action 
space and flow action space, i.e., A x B. Given an action pair (a,b) and current state 2;, 
the transition of this system P(ax141|x+, az = a,b; = b) is shown in Table 2. 

Assuming y = 0.5, we define the objective function f as total discounted cumulative 
reward plus entropy regularization. And define two constraints function h!,h? as stan- 
dard total discounted constraint value with respect to service and flow. Thus, the overall 
optimization problem is given as 
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Current State | P(a41 = 2; —1) Pag = te) P(xt41 = x4 +1) 
l<a,<D-1 a(1 — b) ab+(1—a)(1—b) (1—a)b 
r= L a l-a 0 
a =0 0 1—b(1—-a) b(1 — a) 


Table 2: Transition probability of the queue system 
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Figure 5: Learning Process of the proposed algorithm for concave objective and constraint 
value with « = 0 and & > 0. The total reward is the objective in (37) with c= 1 
and the constraint value is the L.H.S. of the constraint in (37). 


max (A,r) —c > AT, log(AZq) (37) 


S,a 


Sa. (M2) 20 212 


where so ~ p, 7% and 7° are the policies for the service and flow, respectively. It is not 
hard to find that the above objective function is concave and Lipschitz. For simulations, 
we choose L = 5, A = (0.2,0.4, 0.6, 0.8], and 6 = [0.4, 0.5, 0.6, 0.7] for all states besides the 
state s = L, Further, we select Slater variable y = 0.2, number of iteration JT’ = 100000, 
€; = 0.02, and conservative variable « is selected as the statement of Theorem 2. The 
initial distribution p is set as uniform distribution. Moreover, the cost function is set to be 
r(s,a,b) = —s+5, the constraint function for the service is defined as g!(s, a,b) = —10a+4, 
and the constraint function for the flow is g?(s,a,b) = —8(1 — b)? + 1.28. We run 100 
independent simulations and collect the mean value and standard variance. In Fig. 1 and 
Fig. 5, we set c = 0 and c = 1, which means they are the standard CMDP problem 
and concave utility problem, respectively. In each figure, we show the learning process of 
objective value and constraint value for « = 0 and & > 0 respectively (in the case of k > 0, 
the value is chosen based on the value in Theorem 2.). Note that the y-axis in Figs. 1 and 
5 is the objective function (on left) and the constraint function (on right) defined in Eq. 
(37). In both the cases, it can be seen that when x = 0, the constraint values converge 
to a small negative number when T’ goes larger, while for « > 0, the constraint values will 
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converge to a positive value, which matches the result in theory. Further, the objective 
value are similar for both « = 0 and « > 0, while the case where & > 0 helps to achieve 
zero constraint violation. Having & as a hyperparameter in practice can lead to optimal 
objectives where the constraint violations converge to zero. 


7. Conclusion 


In this work, we considered the problem of learning optimal policies for infinite-horizon 
concave constrained Markov Decision Processes (CCMDP) under finite state S and action 
A spaces with J number of constraints. Such constrained reinforcement learning (CRL) 
with concave utility hasn’t been studied in the literature. To solve the problem in a model- 
free manner, we proposed a novel Conservative Stochastic Primal-Dual Algorithm (CSDPA) 
based upon the randomized primal-dual saddle point approach proposed in (Wang, 2020). 
We show that to achieve an e-optimal policy, it is sufficient to run the proposed Algorithm 


1 for eres ae steps. Additionally, we proved that the proposed Algorithm 1 


does not violate any of the J constraints which is unique to this work in the CRL literature. 
The idea is to consider a conservative version (controlled by parameter «) of the original 
constraints and then a suitable choice of « enables us to make the constraint violation zero 
while still achieving the best sample complexity for the objective suboptimality. 


We note that while the results in parametrized setup have been studied for concave 
utility without constraints (Bai et al., 2022a) for linear utility without constraints (Mondal 
& Aggarwal, 2023), and for linear utility with constraints (Bai, Bedi, & Aggarwal, 2023), 
corresponding results with concave utility and constraints remain open. 
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Appendix A. Preliminaries 
A.1 Explanation of Comparison among References in Table 1 


STEP 1: FROM REGRET TO PAC RESULT 


Many references listed in the Table 1 are in the episodic setting and give the result in the 
form of regret, which is defined as 


So Vii (s1) —V,"*(s1) < f(A, |S|,|Al,7,6) with probability at least 1—6 — (38) 
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where T = KH. The following method provides a probably approximately correct (PAC) 
result from the regret. At the end of learning horizon K, a policy 7 can be defined as follow 


m1(s) with probability 1/k 
7(s) =< a,(s) | with probability 1/K (39) 
tK(s) with probability 1/k 


Note that 7 chooses the different policies 7* for k € [K] uniformly at random. Thus, we 
know + Sian 1 (1) = V,4,(s1). Divide Eq. (38) by K on both side, we have 


V;*1(s1) — V;4,(s1) < f(A, EAs 5) 


(40) 


If the function f is sub-linear w.r.t. T, then for large enough K, we have V,*|(s1)—V,",(s1) < 
€ with probability at least 1 — 6, which means that 7 is an €-optimal policy. 


STEP 2: FROM EPISODIC SETTING TO INFINITE HORIZON DISCOUNTED SETTING 


As mentioned above, many references consider the problem in episodic setting. In order to 

make a comparison, it is necessary to have a fair conversion. Here, we use the method from 

(Jin et al., 2018)|footnote 3 in page 3]. Firstly, we check whether the MDP model in the 

given result assume a horizon dependent transition dynamics, i.e, whether P is a function 

of h. If so, then define S’ = SH. If not, then define S’ = S. This conversion is easy to 

understand and reasonable because an extra H times state space is needed if transition 
1 


dynamics is different for each h. After this step, we change H to Ty This is because the 


infinite horizon discounted value function can be simulated by the following algorithm. 


Algorithm 2 Unbiased estimator for Value Function 
Input: Initial distribution p. Discounted factor 7. Policy 7 
Output: Value function V,", 


1: Sample s; ~ p, H ~ Geo(1—7) 
2: for Each state sy in S do 

3: for h=1,2,...,H do 

4 Take action ap, ~ 7(-|s,), observe next state s_,,,; and reward r(sp, ap) 
5: end for 

6: V5 (s1) = x (Sh, an) 

7: end for 


The sample horizon is taken from the geometry distribution with parameter (1—y) and 


thus the expected length of horizon is =e which explains why it is fair to change H to 
— Following these two steps, we convert the result in episodic setting into infinite horizon 


discounted setting. 
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STEP3: FROM HIGH PROBABILITY RESULT TO EXPECTATION RESULT 


After converting the result from episodic setting to infinite horizon discounted setting, we 
get an ¢-optimal result with probability at least 1 — 6. However, the result in this paper 
is in the form of expectation. Thus, we can convert the result with the following method. 
Notice that the value function V,. is bounded by = we have 


ELV, (s1) — ViF(si)] Sex (1-8) +5 «= (41) 


If 6 < €(1—7), then, we have E[V,*(s1) — V,"(s1)] < 2e. 


AN EXAMPLE FOR UC-CFH IN (KALAGARLA ET AL., 2021) 


In the UC-CFH algorithm, the author proposed an ¢€-optimal result with at most oe log(+)) 
episodes, where C’ is the upper bound on the number of possible successor states for a state- 
action pair. Thus, C < |S| and the above equation can be bounded by jes log()). 
Notice that this is already a PAC result and we begin converting it into infinite horizon 
discounted setting. 

e Firstly, we know kK = o( Ghee 


~ 3 3 
Ki oj log(s)). Notice that UC-CFH algorithm doesn’t assume horizon 
dependent transition dynamics (They assume in we model, however, not in the al- 
gorithm and theorem). Thus, by changing H to — we have sample complexity 


~ 3 
O(/SCL log($)). 


log(+)) and thus the total sample complexity is 


e Secondly, change 6 to e(1—7), we get the sample complexity in the form of expectation, 


~ 3 
which means with O( Soe) sample, we have 


E[Vi"(s1) — Yy"*(s1)] Se (42) 


Appendix B. Notations 

For the purpose of analysis in the appendix, we have used the shorthand notation Aq for 
X(s,a). 

Appendix C. Proofs for Section 5.1 


C.1 Proof of Lemma 3 


Proof. Bound on ||u%||,: Let us denote the optimal value of optimization problem in (11) 
as p* and write the corresponding dual problem as 


D,(u, Vv) = max L(A, u,v) = max f(A eee — «)+(1-7) (vit de AP (yP.—l)v 
ie [I ae 
(43) 
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The optimal dual variables are given by 


(ux,v;,) := arg min D,(u, v), (44) 
u>0,v 
and let us denote the optimal dual value by di = D,,(ux, vi). We note that the problem in 
(11) is a convex programming problem. By the Slater condition in the Assumption 3, we 
know strong duality holds, i.e p% = dt. To proceed, let us consider a constant C’ and define 
a set C := {(u,v) > 0|D,(u,v) < C}. For any (u,v) € C and a feasible A which satisfies 
Assumption 3, we could write 


(2) x 
C>D,(u,v) >L(A, u,v) 


=f(A) + Sou! (WA) — 6) + (19) (0, v) + A Pa — Dv 
i€ [I] acA 
SFA) + (uw er 
= f(A) + Ellul, (45) 


where step (a) holds by the definition of dual function and step (b) is true by Assumption 
3 and « < $. From weak duality, we have 


Dy(u,v) 2 dy, > pe = (Ags) (46) 


Now let C = (A*,r), all inequalities in Eq. (46) become equality for (u,v) € {(u,v) > 
O|D,,(u,v) < (A*,r)}. Thus, this set is the optimal dual variable set. We set C = (A*,r) 
and rearrange the Eq. (45) to obtain 
2[f (An) — FAY] @ 2L5llAg — A\l2 ) 20 ,[|lAglh + All] (©) 4b 
p x ~p — ~ oe 

where the step (a) holds by the Lipschitz Assumption 2. The second step holds by triangle 
inequality and last step holds because occupancy measure sum up to 1. 

Bound on ||v*||,.0: To solve the convex programming in (11), the KKT conditions 
should be sufficient and necessary, which can be written as 


al 


(47) 


VaL(An, Uns Ve) = 0 (48a) 

RAK) > 6K WE [I] (48b) 

Sod PDAs = (1- Ve (48c) 
uy i[h'(Ag) — &] = 0 (48d) 

iE [I] 

u, >0 (48e) 


By Eq. (48a), we have for any state-action pair (s, a) 


Val(Ag)sa + SS Uni VAR' (AR) s,0 =(es= Pas) Ve = 0, (49) 
iE [I] 
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where Vyf(Aj,)s,a is the (s,a) element of Vaf(A,,) and uy; is the i” elemnt of vector ux. 
P,s is a column vector and Pys(s’) = P(s’|a,s). Given a fixed action a, denote Va f(Aj)a = . 
Walia, VaS Anan » VafAg)sal’s Vari (Aga = [Vari (Apia, Vah'(Ag)aa-+  Vahi(Ag)sal” 


and P := [Pai,-+- , Pajs|] € RS*!S!. By Eq. (49), we have 


(1B? vt = VafODa + Yo uh sVahi(nda (50) 
iE [I] 
As a result, we have 
AL ¢Ly, @ *) ify\* 
Sf ene lA) pee Fah 3 Ivafada+ So un Vah (Ap alloo = I - YP7 illo 
ie [I] 


Ce Bie ys (d) " 
> |Ivilloo — |lYP? Villoo = (1— Vll¥illeo, 
(51) 
where the step (a) holds by the Lemma 3, step (b) holds by the definition of r, g;, step (c) 
comes from the triangle inequality, and step (d) is true because each row in P” adds up to 


1. Finally, we have the bound ||v%|loo. < - + a. 


C.2 Proof of Lemma 4 
Proof. Consider the Lagrangian in (12) and note that it is convex w.r.t u as well as v. w.r.t 
The gradient of the Lagrange function u and v are given by 
VuL(A, u,v) = h(A) — «1, 
VyL£(A, u,v) = (1-7) p+ 0 (9PT - Dra. 2) 
a 


It is obvious that V2L(A,u,v) = VuvL(A, u,v) = Vyul(A,u,v) = V2L(A,u,v) = 0, 
which means that the Hessian matrix VwL(A,u,v) is a zero matrix. Thus, Lagrange 
function is convex w.r.t w. Then, let us define w = [u’,v7]", w = yar w;, and 
decompose the duality gap as 


Ll Ans u, v) _ L(A, u,v v= 


= 
b 
y) 
ae 
8 
> 
Ss 


lA 
a at 
RIES 


[L(Az, w’) — L(A’, w)] 


ll 
mn 


w’) — L(A‘, w’) + L(A‘, w’) — L(A‘, w)] 


| 

hl 
Ma 
i 
> 


Ss 
Sy 
ob 
ll 
_ 


IA 
Palies 
Mar 


[(VwL(A‘,w pee — A") + (VAL( (A', w’),w ‘_w)], 


ll 
mn 


(53) 
where step (a) holds by Jensen inequality and the step (b) utilizes the convexity of L(A, -) 
and concavity of £(-,w). 
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C.3 Proof of Theorem 1 


Proof. We collect the dual variables u and v in one variable w as defined in Lemma 4 for 
the ease of analysis. The next two Lemmas provide the bound on the terms I and II in Eq. 
(27). 


Lemma 6. Let the iterate sequence {X'} be updated as mentioned in the updates (24) and 
(25) of Algorithm 1, then for any t it holds that 


(VyL(r\', w'),A— A‘) [A L(A\|A‘) — KL(A\|A*)] + + Me ou 
4: ee ~ Vy L(At, wt), At — /- (54) 


Lemma 7. Define W =U x V and consider the iterate sequence {w'} updated according 
to the rule Eq. (22) and (23) in Algorithm 1. For any t, it holds that 


1 i 
(W(t w!) a0! — w) < | Ip! w fowl —w 0 L(A IP 


420 (VwL£(A, w) ~ VwL(r,w), wt — w)|. (55) 
Next, utilizing the results of Lemma 6 and 7 (see proofs in Appendix C.4 and C.5) into 
Lemma 4, we prove the main result in Theorem 1, which establishes the final bound on the 


duality gap as follows. Let A = A* in Eq. (54) and (ul, v") := argminyy £(u, v, A) in Eq. 
(55). Then, sum up Eq. (54) and (55) from t = 1 to T , we have 


> [(vacia’w!), xt _ d*) i (Tw L(a'w'),w' 7 w')| 


i 
KLAN) | B OO ye at pic , ae 
a op 2a Nate +p (Mace ) — VyL(at,w'), r) 
ae A T3 
shes — wi 2. 2 tt) tt 
+ spo llw! — wl? + ZY L(At, w')| +¥ (vee L(At, w) — Vy L(t, w’), w wi) 
we Ts Ts 
(56) 
Combine the above result with the statement of Lemma. 4 to write 
7 6 
RIC(AL, 8,9) — £O,ut, vty] < STE[D). (57) 


j=l 
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We derive an upper bound on the right hand side of (57) in Appendix C.6-C.11. Following 
the results in Appendix C.6-C.11, we have 


log(|S||A]) 4000BL4L7,|S||A| 7 
Dive << TB” Mylo) S (l 7 ee E73] = 0, 
272 (58) 
BT] < wiry <iear, Elry < OLAV IS), 
(1 — 7)?T ay? VT(1—~)y 


Let 6 = oe / SAE and a = aoa the final bound for duality gap could be 
written as 


ig Ls Lnv/|S||A| log (|S||Al) | 4000L ¢ Lnv/|S||.A| log (|S||A]) 
= VJT(1—y¢ VT(1—y)yp 
, AOVISH 6h La VIS | 200L Ln VISIT 
VT(l1-y)~e VT-ye  VvE(-yV¢ 


I|S||Allog(|S||A)) Lr Ln 
<o(y T rear 


E[L(A*, a, ¥) — L(A, ul, v")] 


which is as stated in the statement of Theorem 1. 


C.4 Proof of Lemma 6 


The Proof of Lemma 6 in this work follows similar logic to (Zhang et al., 2021)[Lemma 
C.2]. The main difference lies in the selection of shift parameters M and we provide the 
proof here for completeness. 


Proof. Let us defined Aga as the (s,a)-th component of V,L(A‘,u’,v*). Consider the 

update in Eq. (24) and note that the problem is separable for each component of A and 

could be solved in closed form as follows. 
1 
B 


(Vcr, u’, v’), -X') ae oe > Nes a pe Xsa log (5) 


sa 


aoe 
Spas {Aa g lee x, ; (60) 


where we drop the terms which does not depend upon the variable X and A denotes the 
set of probability distributions. Next, we solve the unconstrained maximization in (60) by 
differentiating and equating it to zero as follows 


d 1 aN 
7, sa AY mae — 
Ba? Lat, sls (5#) |) 


max (Vcr, u,v‘), A— a’) KL(A\|X°) 


1 
nt ; oe | 

i rae B log x B 
Asa=Aea 2 ee 
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After rearranging the terms, we obtain 


t+4 t t 
Asa 7 = ae exp(bA\, = 1). (62) 


Now, we project back the solution on to the set of valid probability distribution and obtain 
the update as 


Aas AL exp(BA%,) 
yaa; a’ rN, tq! -exp(GAt, at) 


where we note that \tt! € A. Next, we analyze the one step KL divergence of A+! to any 


A as 
) Saat (28) 


(63) 


KL(A\|A*) — KL(A\|A‘*) =D rae (5 


Next, we substitute the definition of \{+! to obtain 


KL(A\|A°) — KL(A||A‘*1) =D [Pe — log Da Mee! -exp(GA*,,/) | 


=f (, Vacir’, u’, v')) = log Ss" Nes 7 exp(BA‘,,/) ¢ (65) 


where we utilize the fact that 5° - Asa = 1. To proceed next, recall that we have 


ae 


YUs' — Us 
Asa = 


C Val (A)s,a ar S> uiV xh'(A)s,a — Me (66) 


iE [I] 


where V /(A)(s,a) and Vyh'(A)(s,a) are the (s,a) element of V (A) and Vyh'(A), respec- 
tively. We note that 


L ALL 
jroo — al S [reel + loo] 4] + EE, (67) 
te Le 
Moreover, by Lemma 1 
; 8LrL 
IVaf(A)s,a| < Ly, and Ss" uiVyh'(A)s al S i (68) 
: ~ 
i€[I] 
Hence, with the selection M, = 4 re + | and Mz = Ly ie Lon we can conclude 


that A.q < 0. Since exp(x) < (1+a2+ 5x 2) for 2 < 0, we can upper bound the second term 
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on the right hand side of (65) as 


il 
log Ss" Mia a exp(BA‘,,) < lo &(X ri lal? (1 + BAY a + 5P°(ba)?)) 


s’,a’ s/,a’ 


=tog (1+. pe Sh wal ?) 


s’,a’ s’,a’ 


=tog (14+ 9 (7010 u’,v! nes YoLwA bs ») 


s’,a’ 


2 B(VAL(A, u', v’) ),r") i 7 Nui ta), 
(69) 
where the last inequality holds by log(1 + 2) < x for all x > —1. By combining Eq. (65) 
and (69), we obtain 


KL(A\|A‘) — KL(A\|A‘t) >8 (, VL(rt, ut i ~ B (VAC, u,v’), r') 
a= Sis Ny ral ( A; sla’) : (70) 


s’,a’ 


Rearrange the items and divide both sides by (, to obtain 


0< {KLAN — KE(AIA)] + (VaL(ALulv!), A — d) + : go Met ba). AT) 


Add (Val(r’, u’,v’),A- d‘) on both side to get the desired result. 


C.5 Proof of Lemma 7 


Proof. We can combine the update rule in Eq. (22)-(23) to obtain an update for w © W := 
ux VY. For any w € W, it holds that 


1 _ wll? = [[Tyy(w! — aV L(A‘, w*)) — wl? 
< ||w — aVwL(A", w!) — wl]? 
= ||w! — w|)? + 02 || Vw L(A", w') ||? — 2a (Ww L(A, w), wl — w) 
= Jw! — wl)? + a2 ||VwL(a*, w!)|/? 
= (VwL(A, w) — VwL(A, w) + VwL(A, w), W t_w), 


I|w 


where the first inequality holds by the non-expansiveness of the Projection operator. The 
following equalities holds by expanding the squares and by adding subtracting the term 
2a (VwL(A, w),w! — w). After rearranging the terms in the above expression, we obtain 


2a (VwL(A,w), w! — w) <I! — wl]? = |lw!t! — w]2 + 2 | Pw L(t, w) I? 
~ 2a (Vw Lr’, w') — VwL(at, w'), we — w) _ (72) 
Next, divide the both sides by 2a > 0 to obtain the statement of Lemma 7. 
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C.6 Upper Bound for E[T;] 


E(T;] = KLARA) _ 3D MF sa log Ansa) 5 Mf sallog AF sa — log ASa 
4} i] TB TB K,S@ 1 TB 7 K,Sa K,sa sa és 
log(|S||Al) 
S <7 i re, sa log( |S||A]) = as 7- ae 


TB 


C.7 Upper Bound for E[T)] 


For any fixed u’, v’, A°, we have 
3 v3 2 


ID Man(A )-lul, v2, A] 


si — Us — M- ; ; 
= shat a x, (Tee : 1(5,a)=(se,04) ats Vas(Ay(s, a) = Ss" uiV ah (A)(s, a) ~ Ms) | 


iE [I] 


3) — Us — M : j ‘ 
Emel > Ue [a( eae Ms ; tsdeGeat) + 2( Taf(arls a) + s ujVyh'(A)(s, a) — Ms) | \ 


iE [I] 
(74) 


where in step (a), we use the inequality (a + b)? < 2a? + 2b?. Next, we perform further 


simplifications as 


PID Aiw(A oll, ve, A] 


, ; 2 
<2E e/a, " (M5 is “my J; 2 23 at (vara) (s,a) + Ss" uiVh'(A)(s, a) — Mp) 
oe i€[I] 
2 
— Us i 
= 20 Matha, (MY) + 2D Xe (Va(a)(s.4) + D2 uVan'(ay(s.a) — M6) 
St Qt ate i€ [I] 


2 
Nea, (wy — Us, — Mn) ; 2 
ae : r 2D Abn (vascar( )(s,a) + ws Vah(A)(s,0) ~ Ma) 
wa = Oe sa 


7€ [I] 


(75) 
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Next, after omitting the positive term in the denominator, we get 


ID Aea(A a) uv] 


2 
ae x (varia) (s,a) + ‘2 ujVyh'(A)(s, a) — Me) 


are (LO) NE se iqn 
So aus pra = ) (‘Slain +M}) 
_ LS{SIAlLS + 4 tale n sli ‘ Serta)" 
1-6 
(@) 128L%|S||A|(1 + 40n)? 8LF(1+ cx 
we beng \eaaepe ClO) 200 


4000L5-L;,|S]|A| 
fe Ce) ee) ss 


(76) 
Step (c) holds because we use the boundness of dual variable and Lemma 1. Step (d) holds 
since 0 < y < 1. Next, we write down the term E[T)] as 


BT] = BLE 7D Mal Aba)") 2 SEIS Nal 


OB 
=p DEE [D Aal A A te eT] (77) 


10003L312|S||A| 
(1—y)?Py? 


where step (a) holds by the linear of expectation and step (b) holds due to law of total 
expectation. The last inequality holds by 6 € (0, 5). 


C.8 Expression for E[T73] 


For any fixed u’, v’, \°, we have 


B[V L(At, ul, v*) ue, v’, At] = Va L(A’, ut, v’) — My -1— Mp-1. (78) 
Thus, 
ener Fe pe 
E(Ts] = = S- E[(VaL(a’, w') — VyL(A‘, w*), At - a) =a S"E[(—(Mi + M2) -1,A‘ — A)] = 0 
t=1 t=1 


where the last step is true because (A’- 1) = (A*+1) =1 
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C.9 Upper Bound for E[T] 


For any uc U 


25654 


lJut — ul]? < |u|? + lull? + 2] (ut, a) | < |u|? + full? + 2|/u"|||/ull < 2 F (80) 


where the last inequality holds by ||x||2 < ||x||1 for any x and the definition of U/. Similarly, 
for any ve V 
llvt = vl? < wl? + [Iv PF + lly illlv'll < 1SIlv ll + IIvllZ. + 2I1V" [lool ¥‘lloo) 
< 16|S|| sca al a HOISTS, eu 
= lay = ser aay? 


Finally, combine above two inequalities, 


1 1 400|S|L4L? 
eIT,) = A! = 2_ dat) 2 Teele) cet i ets he ee 89 
[M4] = sayllw — will =opoille — all’ tiv —v'll s (17) Tay? (82) 
C.10 Upper Bound for E[{T5] 
For any fixed u’, v’, \°, we have 
af Pact ul v})IP|al,v',a'] = [Inia at? < QIAN? Aer sar (88) 


where the last step holds because |h’(A)| < 1,Vi € [I] by the Lemma 2 and the fact 
O<K <1. 


Asrat (yes a es,) 2 


Caran 


sf cia ul, v9? 


ulviay = 1s 4,01,8) 80 ia co eso as 


u,v’, x 


(a) Neg (yes — @s,) 
= 5, ,a1,84,80 iia — 7)€sy + (1 “Ox + 7 Pyulvi a 
Stat Ss 
(0) ‘ Asta (ye r) r (e ) 
S Esearsh [sc — Yel? Saye + all 
Stat Stat 
377 +3 6 
2 | 
SBorashan 1-9 + age] S84 Ta 
(84) 


where step (a) holds by using the definition of ¢,,q, in the algorithm. Step (b) comes from 
the Cauchy-Schwartz inequality. Combined Eq. (83), (84) with the definition of w, 


ih T 

“ a oye a ee F 

E(TS] = ap DL EIIVwLO' wid? = oe fEIM.cOatu.vIP + E|Val(rtut,v) |? 
t=1 t=1 


a 6 
aA 
5 (3+ aoa uy < l6al 


(85) 
where the last step holds by 6 € (0, 5) 
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C.11 Upper Bound for E[{T¢| 


Firstly, notice that Tg is different from T3 because w! depends on A, which is a random 
variable. However A* depends only on «, which is a constant. Thus, in order to bound T6, 
we need following Lemma. 


Lemma 8 ((Beck, 2017)). Let Z C R¢@ be a convex set andw : Z — R be a1-strongly convex 
function with respect to norm ||-|| over Z. With the assumption that for alla € Z we have 
w(x) —mingeg w(x) < $D?, then for any martingale difference sequence {Z;,}/_, € R4 and 
any random vector x € Z, it holds that 


K pi& 
[3 anm | sy] (86) 
where || - ||x denotes the dual norm of || - || 
For any fixed u’,v’, A‘, the gradient estimation is unbiased. 
E[VgL(A‘, ul, v‘)] = Vel(r, ul’, v’) (87) 
where @ = uorv. Thus, 
tae : 
E[Te] = = > 2 (Vw L(', w') — VwL(At, w’), we — w') | 
1 r A 
=3 ~ 2 | (Vw L(A! w') . Vw L(A‘, w"), w! ) | (88) 


To apply Lemma 8, let Z = W, w(x) = $||x|?, x = wi and Z, = Veblw" ay = 


VwL(w*,A*), which is a martingale difference. Then, w(x) — minxez w(x) = w(w) = 
$||w||? < D? and thus D > ||w||. The norm of w can be bounded as 
8L rs AT Bee 
|? = [fal]? + Iv? < llullf + ISIlivl2. = (4)? + 218| F Sa | 
" y GU- )¢ (89) 


25614 | 2|S|L4 | 16|S|L4Lj, , 32|S|L4L;, 324/5|L7L 5, 
Ges el ae < Cheep. lena oe laa)? 


Thus, ||w]| < aes =: D. Apply Lemma 8 to Eq. (88), 


18L¢L Ree hs e 
pire] < “SEsEAVIST |S rye eh lwt, AY) — Vk (wt, d°)|) 


VT(1—)¢ VT(1— 9 
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Appendix D. Proofs for Section 5.2 
D.1 Proof of Lemma 5 


Proof. Recall A* is the optimal occupancy measure to the original problem, which gives 
h'(X*) > 0 (91) 


Further, under the Slater Condition Assumption 3, there exists at least one occupancy 
measure A such that 


R'(A) > y (92) 


Define a new occupancy measure A= (1- =o - EX. By the concavity of the cost function, 
it can be shown a feasible occupancy measure to the conservative problem. 


ni(d) = n' (a a =x ze “3) ven om (>") is oh (3) > a = — (93) 


J (-yPT)Au = (1 - 2 S\(I- P27 )ax + : SY \(I-aPT)Ac=(1-ya (94) 


a a a 


Then, we can bound the difference 


FO) = FADS £0) - fA) =F) ia oo) 


(95) 
for) - 270) < Syay's* 
p — Y — 


< 


K 
~~ 

The first step (a) holds because Ax is the optimal solution of the conservative problem, 
which gives larger value function than any other feasible occupancy measure. We drop the 
negative term in the step (b) and the last step (c) is true because f(A*) < 1 by the Lemma 
2. 


D.2 Proof of Theorem 2 


Proof. In order to construct the relation between duality gap and result in occupancy 
measure space, let us consider the expression for the Lagrangian function. By the feasibility 
of Aj, we can write 


L(A, wi, Vv!) =f (Ag) + Cu’, B(A,) — &) + [ Dan OP. —I)-(1-y)p|v' ae 
> f(A): 


Define the set I = {i|h'(A) < 0}. Denote uw’ = [u},uh,---,ui]7, where ui = u; if i e I 


fe <s ay, Ses 
and ui, = 0 otherwise. Define C; := % and Co = 7 + (—4)¢ 


for simplicity, which is the 
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bound for |/u*||; and ||v*||,.., respectively. By the definition of ul, v' 


ucl,vEeV 


L(A,ul,v')= min f(A)+ (u, h(A) — K) + [ Aa)" OP. —-TI)-(1- vol v 


a 


= min 
u’/CU,vEV 

(97) 
where the notation x_ := min{z,0} and the equality holds because u; = 0,7 € Z° for those 
constraints which are satisfied. Let us consider the second term on the right hand side of 
the above expression as follows 


(u’, [h(A) — K]—) <|ju'|[al|[a(A) — 61]-|lo0 
<2C;|[h(A) — £1]— loo. (98) 
Notice that equality in the above inequality is achievable by selecting u; = 2C, for 7 = 
argmax; |h’(A)—#| and ul = 0 fork # j. Such ul gives the minimum of (u’, [h(A) — K]_) = 
2C;||[h(A) — K1]_ ||. Similarly, vi = 2C21 gives the minimum of [E.On OP, -T- 


(_l- no| V = 2C4|| 37, (Aa)? (yPa — I) — (1 — y)pl|l by Holder inequality. Hence, we could 


write the expression in (97) as 
L(A, ul, vl) = (A,r) = ||[h(A) = £I]-|]o0 — 203] S "(Aa)" (YPa —I)-(1—-y)p|ls. (99) 


Combining Eq. (99) with (96) and then taking expectation, we obtain 


BIC(A,, u,v’) —L(A, ul, v!)] > | 70%) F(A) +|[[B(A) #1] |loo+2C2|| SAG)" (9Pa-T) +1 yelhi f. 


(100) 
Combining with the result in Theorem 1, there exists a constant ¢, such that 
E [7ian) — f(A) + |I[h(A) = £1] Iloo + 2C2l| SOAS)" (yPa — 1) + (1 - volh 
~ ( [ASIA os(ISAl  — LrLn ) 
<¢ | . : 101 
( T Lay oe 


Denote DL := ( / JIS||Al loa (|S 1A) : ee). By the Theorem 4 (see Appendix F for refer- 


ence), we directly get 


E[F(An) — f(A)] < ZL, (102a) 
= 2L Ly 
— KI]_|loo < = === < Ly, 
I[h(A) — 61]-lleo SG = 5 T Le (102b) 
T 2L 2L (1—y)Le 
E| S*(yP7 - DA. + (1- Vell, <=> aa (102c) 
Ta Nay) 
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Note that the result in (102a) is at AX and in order to obtain the result for A*, let us 
consider and by the statement of Lemma 5, we could write 


ELS (A*) — FAY] = ELP(A*) — FAR) + E[F(AR) — £(A)] S . +L, (103) 


where we have utilized the upper bound developed in Lemma 5. Next, recall that 


fe 2a ( [751A ceSTAD, byte) 


and from the definition of L, we can write 


(104) 


which establishes the upper bound for the optimally gap for the original optimization prob- 
lem. Further, from the result in (102b), we have for all i € [J] 


E|[h'(A) — w]-| < Le. (105) 


Note that by the definition of [z]_ := min{z,0}, it holds that |[z]_| = — min{x,0} which 
holds due to the fact that min{z,0} is either zero or negative. Therefore, it holds that 
|h?(A) — «| = —[h*(A) — «]_ and thus 


E([h'(A) — KJ_) > —Ly. (106) 


Further, since [x]_ is a concave function with respect to x, via Jensen’s inequality, we can 
write 


[B[A'(R) — w]]_ > E((hi(A) — ]_) > —Le. (107) 
Again, by the definition of [z]_, we simplifies (107) to 
min{E[h'(A)] — «,0} > —Ly. (108) 


Thus, we obtain either E[h’(A)] > « > 0 or E[h'(A)] > «6 — Ly. The first case is trivial and 


for the second case, recall « = 2¢1 (v “sill los (ISA) . ‘ts 


pee % I|S||A|log(|S||A]) LL 
snt(R)] 2 n— Lp = &( yf PUAL oat SIAD | Zen) (109) 
272 fe 
Let T= @ aaa ol By Eq. (102), we have the final result 
B[F(A*) — f(A)] < 3e (110a) 
EIN (A) Sey Vie [I] (110b) 
- 1-—y)ey 
T . _ < (= yey 11 
| DOP T)Aa + (1—)plli < - (110c) 
Recall that it is required « < min{$,1}, which gives 
L2.L?1|S||A|log(|S||A 
p> 4aQte |S|| A] log (|S]| UD sae) (411) 


(hee 
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D.3 Proof of Corollary 1 
Proof. Under the condition that « = 0, it is obvious that A* = At. Thus, we have 


x ~ ( [ASIA og(ISAl)  — LrLn ) 
E| f(A*) — f(A)] < b=eé i . 112 
0) - FO) sb =a( ° Fes (112) 
Furthermore, similar to Eq. (109) 
os I|S||A| log(|S LL 
T 1-y 
272 oO 
Let T = a 2 ee we derive the following result 
E[S(A*) — f(A) S€ (114a) 
B[A'(A)] > —e Vi € [I] (114b) 
; 1— yep 
z een ap ees ae ete 114 
| DOPE Dao +e < 7 (114e) 
Appendix E. Proofs for Section 5.3 
E.1 Proof of Theorem 3 
Proof. By the result in Eq. (30b) and the definition of || - ||1, we have 
5 - 1—y)ey. 
K, — , f rb = < (l= vey. 
[So] Saw 9 DE Pel ow el] s EE cats 
For each s € S, let us define 
| do se = 7 SY Par(s!, 8)Avrat = (1= ps] = = ves. (116) 


We notice that the left hand side of Eq. (116) gives the physical meaning of occupancy 
measure, which can be seen in the following Eq. (117)-(121). Furthermore, Notice that €, 
is a random variable. It is obvious that €, > 0 and E[)°>, €.|] < Tale by Eq. (115). Then, 
define the policy induced by \ as 7(a|s) = < - 2 0. Multiply the both sides of Eq. 
(116) by z(a|s) to obtain 


sa — yy P,(s', s)#(als)Agiat — (1 — 7) pst(als)| = (1—y)est(als), Vac A,s eS. 


(117) 
Now define psa = Ps7(al|s) which can be considered as the initial distribution for state and 
action following policy 7. Define P;(s, a, s’,a’) = P,(s, s’)-7(a’'|s’), which can be considered 
as the transition matrix from current state and action pair (s,a) to next state and action 
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pair (s’,a’). Furthermore, define €sq = €,7(a|s) and it is obvious that }>, €sa = €s. Then, 
Eq. (117) can be simplified as 


sa — Sy Pr (Sos 8,@)Asta? —~(1—Y)pcal = —yesa, VaeAseS. (118) 


With a little abuse of notation +, we can write 


Asa — yy P;(s',a',8,a)Ava = (1—Y)(psat€sa), VaEA,s ES, (119) 


a ss! 


where + means the left hand side can be equal to (1 — ¥)(Psa + €sa) or (1 — Y)(Psa — €sa)- 
Next, define p € Reh = eter sOsians "** 5 Ps)sja1> Pszai>* pRaeeaianl as a vector, define 
RISIAIxISI/Al 


éc RSA = [estas Caress essays Seyigiagal as a vector, and define P; € 
as a matrix. Then, we could write the expression in Eq. (119) in the following compact 
form as 


X—yPTA = (1-7(p+e) (120) 


Notice that ||P?||,; = max; pe |P2 (i, 7)| = 1 and thus ||yP2|| < y. This means (I — 
yPF) is invertable and (I — yP?)~! = 0%) 7'(P£)'. Thus, we have 


X= (1—)(1—P2)-'(pt 8). (121) 
Rearrange items, take inner-product with r and take absolute value, we have 
A—(1—7)(I- yPF)* = (1— yA — yPF) (122) 
Notice that 
(1—7)(-yP3) a = (1-1) |b" +P Pa t+ Pp" (Pa)? +--+ | =A" (123) 


The above equation can be bounded by 


= _ (a) = Z 
E| f(A) — f(A") < LeEl|A — A" || 


= L;(1—7)E||(I- PZ) 'elj2 


b) : 
< Ly(1— y)E|\(— yPF)7 "él: 
c) 
< Ls(1— || - yPF) “i Ellells (124) 
d){— ay oe : f 
Se Sly (PF) ev 

bh 5=0 
e) © F 
<(1-y) dS vey =eg, 

i=0 


where step (a) holds by the Lipschitz assumption 2, step (b) holds by norm inequality, step 
(c) holds by definition of matrix norm, step (d) holds by triangle inequality and E||é||, = 
E[>>. és] S baie The last step (e) is true because ||P? ||, = 1. Finally, we get the result 


BLF(A) — FOA*)| S ev. (125) 
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Recall E[f(A*) — f(A)] < 3e in Eq. (31), hence we can write 


f(A*) — E[F(A")] =(£(A*) — ELF(A)]) + E[F(A) — £")] 


<4e, 


(126) 


which is for the objective suboptimality gap in the primal domain. Rescaling € to 7 finishes 


the proof. Similarly, for the constraints in the primal domain, we could write 


B[hi(A™) — hi(A)] > —ey. 


(127) 


From the result in Eq. (30a), note that we have E[h’(A)|] > ey. Hence, after rearranging 


the terms in (127), we obtain 


B[h'(A™)] > — ep + E[h'(A)] 
=—-ep+eyp 
=0. 


Hence proved. 


E.2 Proof of Corollary 2 
Proof. Recall the result in Eq. (31) and (125), we directly have 


f(A*) - E[f(A*)] < 2¢ 


Similarly, combine Eq. (30a) and (127), we have 


E[h’(A™)] > —2e 


Re-scaling € to 5 finishes the proof. 


Appendix F. Optimization Theory 


Consider the standard optimization problem 
Sopt = min{ f(x) : g(x) <0,Ax+b=0} 
where A € R“”, b € R¢, x € R” and g: R” > R”. Define the value function as 
p(u,t) = min{ f(x) : g(x) <u,Ax+b=t} 
and the dual function as 
a(y,2) = min{ f(x) + y'e(x) +2/(Ax+b)},yeRY,zeR® 
Then the dual problem can be written as 


opt = Max q(y,2Z) 
PY yerR™,zeR¢ 
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(128) 


(129) 


(130) 


(131) 


(132) 


(133) 


(134) 
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Lemma 9. (Theorem 8.59 in (Beck, 2017)) (y,z) is an optimal solution of problem Eq. 
(134) if and only if —(y,z) € Op(0, 0) 


Theorem 4. (Theorem 3.60 in (Beck, 2017)) Let f,g be convex functions, X a nonempty 
conver set, A € R®” and b € R¢. Let fopt> Gopt be the optimal values of the primal and 
dual problems Eq. (131) and (134), respectively. Suppose that font = qopt and that the 
optimal set of the dual problem is nonempty. Let (y*,z*) be the optimal solution of the dual 
problem, Assume that x € X satisfies 


f(X) — foot + Ci|lg(X)+ loo + C || Ax + bl, < ) (135) 
where 6 > 0 and Ci, C2 are constants satisfying Cy > 2\|y*||1, Co > 2\|z*||0, then 


f(x) = fopt < ) 
26 


IIS(X)+lloo < CG, (136) 
26 

AX + Dbl, < = 

JAX +b <& 


Proof. It is trivial that f(x) — fopt < 6 due to the fact that Cy||g(x)+|lo0 and C2|| Ax + bl|1 
are both non-negative. Since (y*,z*) is the optimal solution for the dual problem, it follows 
by Lemma 9 that —(y*,z*) € (0,0). Therefore, for any (u,t) € dom(p) 


p(u,t) — p(0, 0) > (—y*,u) + (—2", t) (137) 


Plugging u = & := [g(x)], and t = t := AX +b into Eq. (137), while using the inequality 
p(a,t) < f(X) and the equality p(0,0) = fopt, we obatin 


(C1 = Ily*[]1) [Allo + (C2 = |l2*lloo) It l]1 = lly“ all tlloo = []2*lloollt lla + Crlltilloo + Calta 
(-y*, a) + (—2*,t) + Ci||tilloo + Calltlli 


IA 


< p(u, t) — p(0,0) + Ci|}tal]o0 + C2||t|h1 
< f(x) = Sopt + C1 ||T]o0 a C4||t]|1 
a’) 
(138) 
It is clear that Cy — ||y*||1 and C2 — ||z*||,. are both non-negative. Thus, 
(C1 = ly"Ih)llilles $6 
(C2 — |Z" loo) IItIl1 < 6 
Finally, using the assumption C; > 2||y*||1, C2 > 2||z*||.0 
" . } 26 
Il9(%)+Illoo = [Milloo S$ GT 5 
1 lylla 1 (140) 
é 26 


[AX + b]|f: = lft] < 


C2 —||Zllo0 ~ C2 
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